home *** CD-ROM | disk | FTP | other *** search
Text File | 1988-10-17 | 35.4 KB | 1,522 lines |
- Reply-To: oz@yetti.UUCP (Ozan Yigit)
- Subject: v07i021: Ed(1)/regex(3)-compatible reg. exp. package
- Newsgroups: mod.sources
- Approved: mirror!rs
-
- Submitted by: ihnp4!yetti!oz (Ozan Yigit)
- Mod.sources: Volume 7, Issue 21
- Archive-name: regex
-
- #!/bin/sh
- # This is a shell archive, meaning:
- # 1. Remove everything above the #!/bin/sh line.
- # 2. Save the resulting text in a file.
- # 3. Execute the file with /bin/sh (not csh) to create the files:
- # README.txt
- # regex.3
- # regex.c
- # re_fail.c
- # grep.c
- # makefile
- # This archive created: Thu Aug 14 17:41:27 1986
- export PATH; PATH=/bin:$PATH
- echo shar: extracting "'README.txt'" '(1925 characters)'
- if test -f 'README.txt'
- then
- echo shar: over-writing existing file "'README.txt'"
- fi
- sed 's/^X//' << \SHAR_EOF > 'README.txt'
- XThis is a pre-release of a bunch of berkelix regex(3)/ed(1)
- Xcompatible regular-expression routines.
- X
- XThese routines are completely public domain. You can do whatever
- Xyou like with them, and hopefully you are professional enough not
- Xto strip out the authorship information, acknowledgements and
- Xreferences.
- X
- XThe reason for this being a *pre-release* is that I received a lot
- Xof useful suggestions about packaging, about additional routines etc.
- Xfrom a few people. I do not have too much time to do those changes
- Xright now, so I am putting this package out for those who needed
- Xit yesterday. Next release will include other routines, and better
- Xpackaging.
- X
- XThese routines are *not* tested under SysV, but they are tested
- Xunder PRO/Venix (2.0) and BSD 4.2.
- X
- XThose uEmacs V30 hackers should take notice of this package,
- Xas well as those who would love to replace the "restricted" regex
- Xroutines in Jove. [Since parts of these routines are based on Conroy's
- Xoriginal work, it would be only fitting to put these into uEmacs.]
- X
- XIn general, these routines run just as fast, or faster than regex library
- Xroutines under BSD 4.2. In some cases, they are slightly slower. I did not
- Xtry too hard to optimize the re_exec routine.
- X
- XCoding style is a la K&R, with lotsa short identifiers. I like it
- Xthat way. All flames should be fed to yetti!dragon.
- X
- XAcknowledgements: Henry Spencer, Hugh Redelmeier and Drew Sullivan made
- X a lot of important suggestions, some of which will be
- X incorporated into the next version.
- X
- X
- XPackage:
- X ----------------------------------------------------
- X README.txt this file.
- X
- X regex.3 comprehensive man page
- X regex.c goodies
- X
- X re_fail.c sample failure routine
- X
- X grep.c a tiny grep for timing tests
- X
- X makefile guess.
- X ----------------------------------------------------
- X
- Xenjoy.. oz [...!utzoo!yetti!oz]
- X [oz@yusol.BITNET || oz@yuyetti.BITNET]
- X
- X Dept. of Computer Science
- X York University
- SHAR_EOF
- if test 1925 -ne "`wc -c 'README.txt'`"
- then
- echo shar: error transmitting "'README.txt'" '(should have been 1925 characters)'
- fi
- echo shar: extracting "'regex.3'" '(7847 characters)'
- if test -f 'regex.3'
- then
- echo shar: over-writing existing file "'regex.3'"
- fi
- sed 's/^X//' << \SHAR_EOF > 'regex.3'
- X.TH regex 3 local
- X.DA Jun 19 1986
- X.SH NAME
- Xre_comp, re_exec, re_subs, re_modw, re_fail \- regular expression handling
- X.SH ORIGIN
- XDept. of Computer Science
- X.br
- XYork University
- X.SH SYNOPSIS
- X.B char *re_comp(pat)
- X.br
- X.B char *pat;
- X.PP
- X.B re_exec(str)
- X.br
- X.B char *str;
- X.PP
- X.B re_subs(src, dst)
- X.br
- X.B char *src;
- X.br
- X.B char *dst;
- X.PP
- X.B void re_fail(msg, op)
- X.br
- X.B char *msg;
- X.br
- X.B char op;
- X.PP
- X.B void re_modw(str)
- X.br
- X.B char *str;
- X
- X.SH DESCRIPTION
- X.PP
- XThese functions implement
- X.IR ed (1)-style
- Xpartial regular expressions and supporting facilities.
- X.PP
- X.I Re_comp
- Xcompiles a pattern string into an internal form (a deterministic finite-state
- Xautomaton) to be executed by
- X.I re_exec
- Xfor pattern matching.
- X.I Re_comp
- Xreturns 0 if the pattern is compiled successfully, otherwise it returns an
- Xerror message string. If
- X.I re_comp
- Xis called with a 0 or a \fInull\fR string, it returns without changing the
- Xcurrently compiled regular expression.
- X.sp
- X.I Re_comp
- Xsupports the same limited set of
- X.I regular expressions
- Xfound in
- X.I ed
- Xand Berkeley
- X.IR regex (3)
- Xroutines:
- X.sp
- X.if n .in +1.6i
- X.if t .in +1i
- X.de Ti
- X.if n .ti -1.6i
- X.if t .ti -1i
- X..
- X.if n .ta 0.8i +0.8i +0.8i
- X.if t .ta 0.5i +0.5i +0.5i
- X.Ti
- X[1] \fIchar\fR Matches itself, unless it is a special
- Xcharacter (meta-character): \fB. \\ [ ] * + ^ $\fR
- X
- X.Ti
- X[2] \fB.\fR Matches \fIany\fR character.
- X
- X.Ti
- X[3] \fB\\\fR Matches the character following it, except
- Xwhen followed by a digit 1 to 9, \fB(\fR, fB)\fR, \fB<\fR or \fB>\fR.
- X(see [7], [8] and [9]) It is used as an escape character for all
- Xother meta-characters, and itself. When used
- Xin a set ([4]), it is treated as an ordinary
- Xcharacter.
- X
- X.Ti
- X[4] \fB[\fIset\fB]\fR Matches one of the characters in the set.
- XIf the first character in the set is \fB^\fR,
- Xit matches a character NOT in the set. A
- Xshorthand
- X.IR S - E
- Xis used to specify a set of
- Xcharacters
- X.I S
- Xup to
- X.IR E ,
- Xinclusive. The special
- Xcharacters \fB]\fR and \fB-\fR have no special
- Xmeaning if they appear as the first chars
- Xin the set.
- X.nf
- X examples: match:
- X [a-z] any lowercase alpha
- X [^]-] any char except ] and -
- X [^A-Z] any char except
- X uppercase alpha
- X [a-zA-Z0-9] any alphanumeric
- X.fi
- X
- X.Ti
- X[5] \fB*\fR Any regular expression form [1] to [4], followed by
- Xclosure char (*) matches zero or more matches of
- Xthat form.
- X
- X.Ti
- X[6] \fB+\fR Same as [5], except it matches one or more.
- X
- X.Ti
- X[7] A regular expression in the form [1] to [10], enclosed
- Xas \\(\fIform\fR\\) matches what form matches. The enclosure
- Xcreates a set of tags, used for [8] and for
- Xpattern substitution in
- X.I re_subs.
- XThe tagged forms are numbered
- Xstarting from 1.
- X
- X.Ti
- X[8] A \\ followed by a digit 1 to 9 matches whatever a
- Xpreviously tagged regular expression ([7]) matched.
- X
- X.Ti
- X[9] \fB\\<\fR Matches the beginning of a \fIword\fR,
- Xthat is, an empty string followed by a
- Xletter, digit, or _ and not preceded by
- Xa letter, digit, or _ .
- X.Ti
- X \fB\\>\fR Matches the end of a \fIword\fR,
- Xthat is, an empty string preceded
- Xby a letter, digit, or _ , and not
- Xfollowed by a letter, digit, or _ .
- X
- X.Ti
- X[10] A composite regular expression
- X\fIxy\fR where \fIx\fR and \fIy\fR
- Xare in the form of [1] to [10] matches the longest
- Xmatch of \fIx\fR followed by a match for \fIy\fR.
- X
- X.Ti
- X[11] \fB^ $\fR a regular expression starting with a \fB^\fR character
- Xand/or ending with a \fB$\fR character, restricts the
- Xpattern matching to the beginning of the line,
- Xand/or the end of line [anchors]. Elsewhere in the
- Xpattern, \fB^\fR and \fB$\fR are treated as ordinary characters.
- X.if n .in -1.6i
- X.if t .in -1i
- X
- X.PP
- X.I Re_exec
- Xexecutes the internal form produced by
- X.I re_comp
- Xand searches the argument string for the regular expression described
- Xby the internal
- Xform.
- X.I Re_exec
- Xreturns 1 if the last regular expression pattern is matched within the string,
- X0 if no match is found. In case of an internal error (corrupted internal
- Xform),
- X.I re_exec
- Xcalls the user-supplied
- X.I re_fail
- Xand returns 0.
- X.PP
- XThe strings passed to both
- X.I re_comp
- Xand
- X.I re_exec
- Xmay have trailing or embedded newline characters. The strings
- Xmust be terminated by nulls.
- X.PP
- X.I Re_subs
- Xdoes
- X.IR ed -style
- Xpattern substitution, after a successful match is found by
- X.I re_exec.
- XThe source string parameter to
- X.I re_subs
- Xis copied to the destination string with the following interpretation;
- X.sp
- X.if n .in +1.6i
- X.if t .in +1i
- X.Ti
- X[1] & Substitute the entire matched string in the destination.
- X
- X.Ti
- X[2] \\\fIn\fR Substitute the substring matched by a tagged subpattern
- Xnumbered \fIn\fR, where \fIn\fR is between 1 to 9, inclusive.
- X
- X.Ti
- X[3] \\\fIchar\fR Treat the next character literally,
- Xunless the character is a digit ([2]).
- X.if n .in -1.6i
- X.if t .in -1i
- X
- X.PP
- XIf the copy operation with the substitutions is successful,
- X.I re_subs
- Xreturns 1.
- XIf the source string is corrupted, or the last call to
- X.I re_exec
- Xfails, it returns 0.
- X
- X.I Re_modw
- Xis used to
- Xadd new characters into an internal table to
- Xchange the re_exec's understanding of what
- Xa \fIword\fR should look like, when matching with \fB\\<\fR and \fB\\>\fR
- Xconstructs. If the string parameter is 0 or null string,
- Xthe table is reset back to the default, which contains \fBA-Z a-z 0-9 _\fR .
- X
- X.I Re_fail
- Xis a user-supplied routine to handle internal errors.
- X.I re_exec
- Xcalls
- X.I re_fail
- Xwith an error message string, and the opcode character that caused the error.
- XThe default
- X.I re_fail
- Xroutine simply prints the message and the opcode character to
- X.I stderr
- Xand invokes
- X.IR exit (2).
- X.SH EXAMPLES
- XIn the examples below, the
- X.I dfaform
- Xdescribes the internal form after the pattern is compiled. For additional
- Xdetails, refer to the sources.
- X.PP
- X.ta 0.5i +0.5i +0.5i
- X.nf
- Xfoo*.*
- X dfaform: CHR f CHR o CLO CHR o END CLO ANY END END
- X matches: \fIfo foo fooo foobar fobar foxx ...\fR
- X
- Xfo[ob]a[rz]
- X dfaform: CHR f CHR o CCL 2 o b CHR a CCL 2 r z END
- X matches: \fIfobar fooar fobaz fooaz\fR
- X
- Xfoo\\\\+
- X dfaform: CHR f CHR o CHR o CHR \\ CLO CHR \\ END END
- X matches: \fIfoo\\ foo\\\\ foo\\\\\\ ...\fR
- X
- X\\(foo\\)[1-3]\\1 (same as foo[1-3]foo, but takes less internal space)
- X dfaform: BOT 1 CHR f CHR o CHR o EOT 1 CCL 3 1 2 3 REF 1 END
- X matches: \fIfoo1foo foo2foo foo3foo\fR
- X
- X\\(fo.*\\)-\\1
- X dfaform: BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END
- X matches: \fIfoo-foo fo-fo fob-fob foobar-foobar ...\fR
- X.SH DIAGNOSTICS
- X.I Re_comp
- Xreturns one of the following strings if an error occurs:
- X.PP
- X.nf
- X.in +0.5i
- X\fINo previous regular expression,
- XEmpty closure,
- XIllegal closure,
- XCyclical reference,
- XUndetermined reference,
- XUnmatched \e(,
- XMissing ],
- XNull pattern inside \e(\e),
- XNull pattern inside \e<\e>,
- XToo many \e(\e) pairs,
- XUnmatched \e)\fP.
- X.in -0.5i
- X.fi
- X.SH REFERENCES
- X.if n .ta 3i
- X.if t .ta 1.8i
- X.nf
- X\fISoftware tools\fR Kernighan & Plauger
- X\fISoftware tools in Pascal\fR Kernighan & Plauger
- X\fIGrep sources\fR [rsx-11 C dist] David Conroy
- X\fIEd - text editor\fR Unix Programmer's Manual
- X\fIAdvanced editing on Unix\fR B. W. Kernighan
- X\fIRegExp sources\fR Henry Spencer
- X.fi
- X.SH "HISTORY AND NOTES"
- XThese routines are derived from various implementations
- Xfound in
- X.I "Software Tools"
- Xbooks, and David Conroy's
- X.I grep.
- XThey are NOT derived from licensed/restricted software.
- XFor more interesting/academic/complicated implementations,
- Xsee Henry Spencer's
- X.I regexp
- Xroutines (V8), or
- X.I "GNU Emacs"
- Xpattern
- Xmatching module.
- X.PP
- XThe
- X.I re_comp
- Xand
- X.I re_exec
- Xroutines perform
- X.I almost
- Xas well as their licensed counterparts, sometimes better.
- XIn very few instances, they
- Xare about 10% to 15% slower.
- X.SH AUTHOR
- XOzan S. Yigit (oz)
- X.br
- Xusenet: utzoo!yetti!oz
- X.br
- Xbitnet: oz@yusol || oz@yuyetti
- X.SH "SEE ALSO"
- Xed(1), ex(1), egrep(1), fgrep(1), grep(1), regex(3)
- X.SH BUGS
- XThese routines are \fIPublic Domain\fR. You can get them
- Xin source.
- X.br
- XThe internal storage for the \fIdfa form\fR is not checked for
- Xoverflows. Currently, it is 1024 bytes.
- X.br
- XOthers, no doubt.
- SHAR_EOF
- if test 7847 -ne "`wc -c 'regex.3'`"
- then
- echo shar: error transmitting "'regex.3'" '(should have been 7847 characters)'
- fi
- echo shar: extracting "'regex.c'" '(18779 characters)'
- if test -f 'regex.c'
- then
- echo shar: over-writing existing file "'regex.c'"
- fi
- sed 's/^X//' << \SHAR_EOF > 'regex.c'
- X/*
- X * regex - Regular expression pattern matching
- X * and replacement
- X *
- X *
- X * By: Ozan S. Yigit (oz)
- X * Dept. of Computer Science
- X * York University
- X *
- X *
- X * These routines are the PUBLIC DOMAIN equivalents
- X * of regex routines as found in 4.nBSD UN*X, with minor
- X * extensions.
- X *
- X * These routines are derived from various implementations
- X * found in software tools books, and Conroy's grep. They
- X * are NOT derived from licensed/restricted software.
- X * For more interesting/academic/complicated implementations,
- X * see Henry Spencer's regexp routines, or GNU Emacs pattern
- X * matching module.
- X *
- X * Routines:
- X * re_comp: compile a regular expression into
- X * a DFA.
- X *
- X * char *re_comp(s)
- X * char *s;
- X *
- X * re_exec: execute the DFA to match a pattern.
- X *
- X * int re_exec(s)
- X * char *s;
- X *
- X * re_modw change re_exec's understanding of what
- X * a "word" looks like (for \< and \>)
- X * by adding into the hidden word-character
- X * table.
- X *
- X * void re_modw(s)
- X * char *s;
- X *
- X * re_subs: substitute the matched portions in
- X * a new string.
- X *
- X * int re_subs(src, dst)
- X * char *src;
- X * char *dst;
- X *
- X * re_fail: failure routine for re_exec.
- X *
- X * void re_fail(msg, op)
- X * char *msg;
- X * char op;
- X *
- X * Regular Expressions:
- X *
- X * [1] char matches itself, unless it is a special
- X * character (metachar): . \ [ ] * + ^ $
- X *
- X * [2] . matches any character.
- X *
- X * [3] \ matches the character following it, except
- X * when followed by a left or right round bracket,
- X * a digit 1 to 9 or a left or right angle bracket.
- X * (see [7], [8] and [9])
- X * It is used as an escape character for all
- X * other meta-characters, and itself. When used
- X * in a set ([4]), it is treated as an ordinary
- X * character.
- X *
- X * [4] [set] matches one of the characters in the set.
- X * If the first character in the set is "^",
- X * it matches a character NOT in the set. A
- X * shorthand S-E is used to specify a set of
- X * characters S upto E, inclusive. The special
- X * characters "]" and "-" have no special
- X * meaning if they appear as the first chars
- X * in the set.
- X * examples: match:
- X *
- X * [a-z] any lowercase alpha
- X *
- X * [^]-] any char except ] and -
- X *
- X * [^A-Z] any char except uppercase
- X * alpha
- X *
- X * [a-zA-Z] any alpha
- X *
- X * [5] * any regular expression form [1] to [4], followed by
- X * closure char (*) matches zero or more matches of
- X * that form.
- X *
- X * [6] + same as [5], except it matches one or more.
- X *
- X * [7] a regular expression in the form [1] to [10], enclosed
- X * as \(form\) matches what form matches. The enclosure
- X * creates a set of tags, used for [8] and for
- X * pattern substution. The tagged forms are numbered
- X * starting from 1.
- X *
- X * [8] a \ followed by a digit 1 to 9 matches whatever a
- X * previously tagged regular expression ([7]) matched.
- X *
- X * [9] \< a regular expression starting with a \< construct
- X * \> and/or ending with a \> construct, restricts the
- X * pattern matching to the beginning of a word, and/or
- X * the end of a word. A word is defined to be a character
- X * string beginning and/or ending with the characters
- X * A-Z a-z 0-9 and _. It must also be preceded and/or
- X * followed by any character outside those mentioned.
- X *
- X * [10] a composite regular expression xy where x and y
- X * are in the form [1] to [10] matches the longest
- X * match of x followed by a match for y.
- X *
- X * [11] ^ a regular expression starting with a ^ character
- X * $ and/or ending with a $ character, restricts the
- X * pattern matching to the beginning of the line,
- X * or the end of line. [anchors] Elsewhere in the
- X * pattern, ^ and $ are treated as ordinary characters.
- X *
- X *
- X * Acknowledgements:
- X *
- X * HCR's Hugh Redelmeier has been most helpful in various
- X * stages of development. He convinced me to include BOW
- X * and EOW constructs, originally invented by Rob Pike at
- X * the University of Toronto.
- X *
- X * References:
- X * Software tools Kernighan & Plauger
- X * Software tools in Pascal Kernighan & Plauger
- X * Grep [rsx-11 C dist] David Conroy
- X * ed - text editor Un*x Programmer's Manual
- X * Advanced editing on Un*x B. W. Kernighan
- X * RegExp routines Henry Spencer
- X *
- X * Notes:
- X *
- X * This implementation uses a bit-set representation for character
- X * classes for speed and compactness. Each character is represented
- X * by one bit in a 128-bit block. Thus, CCL or NCL always takes a
- X * constant 16 bytes in the internal dfa, and re_exec does a single
- X * bit comparison to locate the character in the set.
- X *
- X * Examples:
- X *
- X * pattern: foo*.*
- X * compile: CHR f CHR o CLO CHR o END CLO ANY END END
- X * matches: fo foo fooo foobar fobar foxx ...
- X *
- X * pattern: fo[ob]a[rz]
- X * compile: CHR f CHR o CCL 2 o b CHR a CCL bitset END
- X * matches: fobar fooar fobaz fooaz
- X *
- X * pattern: foo\\+
- X * compile: CHR f CHR o CHR o CHR \ CLO CHR \ END END
- X * matches: foo\ foo\\ foo\\\ ...
- X *
- X * pattern: \(foo\)[1-3]\1 (same as foo[1-3]foo)
- X * compile: BOT 1 CHR f CHR o CHR o EOT 1 CCL bitset REF 1 END
- X * matches: foo1foo foo2foo foo3foo
- X *
- X * pattern: \(fo.*\)-\1
- X * compile: BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END
- X * matches: foo-foo fo-fo fob-fob foobar-foobar ...
- X *
- X */
- X
- X#define MAXDFA 1024
- X#define MAXTAG 10
- X
- X#define OKP 1
- X#define NOP 0
- X
- X#define CHR 1
- X#define ANY 2
- X#define CCL 3
- X#define NCL 4
- X#define BOL 5
- X#define EOL 6
- X#define BOT 7
- X#define EOT 8
- X#define BOW 9
- X#define EOW 10
- X#define REF 11
- X#define CLO 12
- X
- X#define END 0
- X
- X/*
- X * The following defines are not meant
- X * to be changeable. They are for readibility
- X * only.
- X *
- X */
- X#define MAXCHR 128
- X#define CHRBIT 8
- X#define BITBLK MAXCHR/CHRBIT
- X#define BLKIND 0170
- X#define BITIND 07
- X
- X#define ASCIIB 0177
- X
- Xtypedef /*unsigned*/ char CHAR;
- X
- Xstatic int tagstk[MAXTAG]; /* subpat tag stack..*/
- Xstatic CHAR dfa[MAXDFA]; /* automaton.. */
- Xstatic int sta = NOP; /* status of lastpat */
- X
- Xstatic CHAR bittab[BITBLK]; /* bit table for CCL */
- X
- Xstatic void
- Xchset(c) register CHAR c; { bittab[((c)&BLKIND)>>3] |= 1<<((c)&BITIND); }
- X
- X#define badpat(x) return(*dfa = END, x)
- X#define store(x) *mp++ = x
- X
- Xchar *
- Xre_comp(pat)
- Xchar *pat;
- X{
- X register char *p; /* pattern pointer */
- X register CHAR *mp=dfa; /* dfa pointer */
- X register CHAR *lp; /* saved pointer.. */
- X register CHAR *sp=dfa; /* another one.. */
- X
- X register int tagi = 0; /* tag stack index */
- X register int tagc = 1; /* actual tag count */
- X
- X register int n;
- X int c1, c2;
- X
- X if (!pat || !*pat)
- X if (sta)
- X return(0);
- X else
- X badpat("No previous regular expression");
- X sta = NOP;
- X
- X for (p = pat; *p; p++) {
- X lp = mp;
- X switch(*p) {
- X
- X case '.': /* match any char.. */
- X store(ANY);
- X break;
- X
- X case '^': /* match beginning.. */
- X if (p == pat)
- X store(BOL);
- X else {
- X store(CHR);
- X store(*p);
- X }
- X break;
- X
- X case '$': /* match endofline.. */
- X if (!*(p+1))
- X store(EOL);
- X else {
- X store(CHR);
- X store(*p);
- X }
- X break;
- X
- X case '[': /* match char class..*/
- X
- X if (*++p == '^') {
- X store(NCL);
- X p++;
- X }
- X else
- X store(CCL);
- X
- X if (*p == '-') /* real dash */
- X chset(*p++);
- X if (*p == ']') /* real brac */
- X chset(*p++);
- X while (*p && *p != ']') {
- X if (*p == '-' && *(p+1) && *(p+1) != ']') {
- X p++;
- X c1 = *(p-2) + 1;
- X c2 = *p++;
- X while (c1 <= c2)
- X chset(c1++);
- X }
- X#ifdef EXTEND
- X else if (*p == '\\' && *(p+1)) {
- X p++;
- X chset(*p++);
- X }
- X#endif
- X else
- X chset(*p++);
- X }
- X if (!*p)
- X badpat("Missing ]");
- X
- X for (n = 0; n < BITBLK; bittab[n++] = (char) 0)
- X store(bittab[n]);
- X
- X break;
- X
- X case '*': /* match 0 or more.. */
- X case '+': /* match 1 or more.. */
- X if (p == pat)
- X badpat("Empty closure");
- X lp = sp; /* previous opcode */
- X if (*lp == CLO) /* equivalence.. */
- X break;
- X switch(*lp) {
- X
- X case BOL:
- X case BOT:
- X case EOT:
- X case BOW:
- X case EOW:
- X case REF:
- X badpat("Illegal closure");
- X default:
- X break;
- X }
- X
- X if (*p == '+')
- X for (sp = mp; lp < sp; lp++)
- X store(*lp);
- X
- X store(END);
- X store(END);
- X sp = mp;
- X while (--mp > lp)
- X *mp = mp[-1];
- X store(CLO);
- X mp = sp;
- X break;
- X
- X case '\\': /* tags, backrefs .. */
- X switch(*++p) {
- X
- X case '(':
- X if (tagc < MAXTAG) {
- X tagstk[++tagi] = tagc;
- X store(BOT);
- X store(tagc++);
- X }
- X else
- X badpat("Too many \\(\\) pairs");
- X break;
- X case ')':
- X if (*sp == BOT)
- X badpat("Null pattern inside \\(\\)");
- X if (tagi > 0) {
- X store(EOT);
- X store(tagstk[tagi--]);
- X }
- X else
- X badpat("Unmatched \\)");
- X break;
- X case '<':
- X store(BOW);
- X break;
- X case '>':
- X if (*sp == BOW)
- X badpat("Null pattern inside \\<\\>");
- X store(EOW);
- X break;
- X case '1':
- X case '2':
- X case '3':
- X case '4':
- X case '5':
- X case '6':
- X case '7':
- X case '8':
- X case '9':
- X n = *p-'0';
- X if (tagi > 0 && tagstk[tagi] == n)
- X badpat("Cyclical reference");
- X if (tagc > n) {
- X store(REF);
- X store(n);
- X }
- X else
- X badpat("Undetermined reference");
- X break;
- X#ifdef EXTEND
- X case 'b':
- X store(CHR);
- X store('\b');
- X break;
- X case 'n':
- X store(CHR);
- X store('\n');
- X break;
- X case 'f':
- X store(CHR);
- X store('\f');
- X break;
- X case 'r':
- X store(CHR);
- X store('\r');
- X break;
- X case 't':
- X store(CHR);
- X store('\t');
- X break;
- X#endif
- X default:
- X store(CHR);
- X store(*p);
- X }
- X break;
- X
- X default : /* an ordinary char */
- X store(CHR);
- X store(*p);
- X break;
- X }
- X sp = lp;
- X }
- X if (tagi > 0)
- X badpat("Unmatched \\(");
- X store(END);
- X sta = OKP;
- X return(0);
- X}
- X
- X
- Xstatic char *bol;
- Xstatic char *bopat[MAXTAG];
- Xstatic char *eopat[MAXTAG];
- Xchar *pmatch();
- X
- X/*
- X * re_exec:
- X * execute dfa to find a match.
- X *
- X * special cases: (dfa[0])
- X * BOL
- X * Match only once, starting from the
- X * beginning.
- X * CHR
- X * First locate the character without
- X * calling pmatch, and if found, call
- X * pmatch for the remaining string.
- X * END
- X * re_comp failed, poor luser did not
- X * check for it. Fail fast.
- X *
- X * If a match is found, bopat[0] and eopat[0] are set
- X * to the beginning and the end of the matched fragment,
- X * respectively.
- X *
- X */
- X
- Xint
- Xre_exec(lp)
- Xregister char *lp;
- X{
- X register char c;
- X register char *ep = 0;
- X register CHAR *ap = dfa;
- X
- X bol = lp;
- X
- X bopat[0] = 0;
- X bopat[1] = 0;
- X bopat[2] = 0;
- X bopat[3] = 0;
- X bopat[4] = 0;
- X bopat[5] = 0;
- X bopat[6] = 0;
- X bopat[7] = 0;
- X bopat[8] = 0;
- X bopat[9] = 0;
- X
- X switch(*ap) {
- X
- X case BOL: /* anchored: match from BOL only */
- X ep = pmatch(lp,ap);
- X break;
- X case CHR: /* ordinary char: locate it fast */
- X c = *(ap+1);
- X while (*lp && *lp != c)
- X lp++;
- X if (!*lp) /* if EOS, fail, else fall thru. */
- X return(0);
- X default: /* regular matching all the way. */
- X while (*lp) {
- X if ((ep = pmatch(lp,ap)))
- X break;
- X lp++;
- X }
- X break;
- X case END: /* munged automaton. fail always */
- X return(0);
- X }
- X if (!ep)
- X return(0);
- X
- X bopat[0] = lp;
- X eopat[0] = ep;
- X return(1);
- X}
- X
- X/*
- X * pmatch:
- X * internal routine for the hard part
- X *
- X * This code is mostly snarfed from an early
- X * grep written by David Conroy. The backref and
- X * tag stuff, and various other mods are by oZ.
- X *
- X * special cases: (dfa[n], dfa[n+1])
- X * CLO ANY
- X * We KNOW ".*" will match ANYTHING
- X * upto the end of line. Thus, go to
- X * the end of line straight, without
- X * calling pmatch recursively. As in
- X * the other closure cases, the remaining
- X * pattern must be matched by moving
- X * backwards on the string recursively,
- X * to find a match for xy (x is ".*" and
- X * y is the remaining pattern) where
- X * the match satisfies the LONGEST match
- X * for x followed by a match for y.
- X * CLO CHR
- X * We can again scan the string forward
- X * for the single char without recursion,
- X * and at the point of failure, we execute
- X * the remaining dfa recursively, as
- X * described above.
- X *
- X * At the end of a successful match, bopat[n] and eopat[n]
- X * are set to the beginning and end of subpatterns matched
- X * by tagged expressions (n = 1 to 9).
- X *
- X */
- X
- Xextern void re_fail();
- X
- X/*
- X * character classification table for word boundary
- X * operators BOW and EOW. the reason for not using
- X * ctype macros is that we can let the user add into
- X * our own table. see re_modw. This table is not in
- X * the bitset form, since we may wish to extend it
- X * in the future for other character classifications.
- X *
- X * TRUE for 0-9 A-Z a-z _
- X */
- Xstatic char chrtyp[MAXCHR] = {
- X 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
- X 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
- X 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
- X 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
- X 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
- X 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
- X 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
- X 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
- X 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
- X 1, 0, 0, 0, 0, 1, 0, 1, 1, 1,
- X 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
- X 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
- X 1, 1, 1, 0, 0, 0, 0, 0
- X };
- X
- X#define inascii(x) (0177&(x))
- X#define iswordc(x) chrtyp[inascii(x)]
- X#define isinset(x,y) ((x)[((y)&BLKIND)>>3] & (1<<((y)&BITIND)))
- X
- X/*
- X * skip values for CLO XXX to skip past the closure
- X *
- X */
- X
- X#define ANYSKIP 2 /* CLO ANY END ... */
- X#define CHRSKIP 3 /* CLO CHR chr END ... */
- X#define CCLSKIP 18 /* CLO CCL 16bytes END ... */
- X
- Xstatic char *
- Xpmatch(lp, ap)
- Xregister char *lp;
- Xregister CHAR *ap;
- X{
- X register char *e; /* extra pointer for CLO */
- X register char *bp; /* beginning of subpat.. */
- X register char *ep; /* ending of subpat.. */
- X register int op, c, n;
- X char *are; /* to save the line ptr. */
- X
- X while ((op = *ap++) != END)
- X switch(op) {
- X
- X case CHR:
- X if (*lp++ != *ap++)
- X return(0);
- X break;
- X case ANY:
- X if (!*lp++)
- X return(0);
- X break;
- X case CCL:
- X c = *lp++;
- X if (!isinset(ap,c))
- X return(0);
- X ap += BITBLK;
- X break;
- X case NCL:
- X c = *lp++;
- X if (isinset(ap,c))
- X return(0);
- X ap += BITBLK;
- X break;
- X case BOL:
- X if (lp != bol)
- X return(0);
- X break;
- X case EOL:
- X if (*lp)
- X return(0);
- X break;
- X case BOT:
- X bopat[*ap++] = lp;
- X break;
- X case EOT:
- X eopat[*ap++] = lp;
- X break;
- X case BOW:
- X if (!(lp!=bol && iswordc(lp[-1])) && iswordc(*lp))
- X break;
- X return(0);
- X case EOW:
- X if ((lp!=bol && iswordc(lp[-1])) && !iswordc(*lp))
- X break;
- X return(0);
- X case REF:
- X n = *ap++;
- X bp = bopat[n];
- X ep = eopat[n];
- X while (bp < ep)
- X if (*bp++ != *lp++)
- X return(0);
- X break;
- X case CLO:
- X are = lp;
- X switch(*ap) {
- X
- X case ANY:
- X while (*lp)
- X lp++;
- X n = ANYSKIP;
- X break;
- X case CHR:
- X c = *(ap+1);
- X while (*lp && c == *lp)
- X lp++;
- X n = CHRSKIP;
- X break;
- X case CCL:
- X case NCL:
- X while (*lp && (e = pmatch(lp, ap)))
- X lp = e;
- X n = CCLSKIP;
- X break;
- X default:
- X re_fail("closure: bad dfa.", *ap);
- X return(0);
- X }
- X
- X ap += n;
- X
- X while (lp >= are) {
- X if (e = pmatch(lp, ap))
- X return(e);
- X --lp;
- X }
- X return(0);
- X default:
- X re_fail("re_exec: bad dfa.", op);
- X return(0);
- X }
- X return(lp);
- X}
- X
- X/*
- X * re_modw:
- X * add new characters into the word table to
- X * change the re_exec's understanding of what
- X * a word should look like. Note that we only
- X * accept additions into the word definition.
- X *
- X * If the string parameter is 0 or null string,
- X * the table is reset back to the default, which
- X * contains A-Z a-z 0-9 _. [We use the compact
- X * bitset representation for the default table]
- X *
- X */
- X
- Xstatic char deftab[16] = {
- X 0, 0, 0, 0, 0, 0, 377, 003, 376, 377, 377, 207,
- X 376, 377, 377, 007
- X};
- X
- Xvoid
- Xre_modw(s)
- Xregister char *s;
- X{
- X register int i;
- X
- X if (!s || !*s) {
- X for (i = 0; i < MAXCHR; i++)
- X if (!isinset(deftab,i))
- X iswordc(i) = 0;
- X }
- X else
- X while(*s)
- X iswordc(*s++) = 1;
- X}
- X
- X/*
- X * re_subs:
- X * substitute the matched portions of the src in
- X * dst.
- X *
- X * & substitute the entire matched pattern.
- X *
- X * \digit substitute a subpattern, with the given
- X * tag number. Tags are numbered from 1 to
- X * 9. If the particular tagged subpattern
- X * does not exist, null is substituted.
- X *
- X */
- Xint
- Xre_subs(src, dst)
- Xregister char *src;
- Xregister char *dst;
- X{
- X register char c;
- X register int pin;
- X register char *bp;
- X register char *ep;
- X
- X if (!*src || !bopat[0])
- X return(0);
- X
- X while (c = *src++) {
- X switch(c) {
- X
- X case '&':
- X pin = 0;
- X break;
- X
- X case '\\':
- X c = *src++;
- X if (c >= '0' && c <= '9') {
- X pin = c - '0';
- X break;
- X }
- X
- X default:
- X *dst++ = c;
- X continue;
- X }
- X
- X if ((bp = bopat[pin]) && (ep = eopat[pin])) {
- X while (*bp && bp < ep)
- X *dst++ = *bp++;
- X if (bp < ep)
- X return(0);
- X }
- X }
- X *dst = (char) 0;
- X return(1);
- X}
- X
- X#ifdef DEBUG
- X/*
- X * symbolic - produce a symbolic dump of the
- X * dfa
- X */
- Xsymbolic(s)
- Xchar *s;
- X{
- X printf("pattern: %s\n", s);
- X printf("dfacode:\n");
- X dfadump(dfa);
- X}
- X
- Xstatic
- Xdfadump(ap)
- XCHAR *ap;
- X{
- X register int n;
- X
- X while (*ap != END)
- X switch(*ap++) {
- X case CLO:
- X printf("CLOSURE");
- X dfadump(ap);
- X switch(*ap) {
- X case CHR:
- X n = CHRSKIP;
- X break;
- X case ANY:
- X n = ANYSKIP;
- X break;
- X case CCL:
- X case NCL:
- X n = CCLSKIP;
- X break;
- X }
- X ap += n;
- X break;
- X case CHR:
- X printf("\tCHR %c\n",*ap++);
- X break;
- X case ANY:
- X printf("\tANY .\n");
- X break;
- X case BOL:
- X printf("\tBOL -\n");
- X break;
- X case EOL:
- X printf("\tEOL -\n");
- X break;
- X case BOT:
- X printf("BOT: %d\n",*ap++);
- X break;
- X case EOT:
- X printf("EOT: %d\n",*ap++);
- X break;
- X case BOW:
- X printf("BOW\n");
- X break;
- X case EOW:
- X printf("EOW\n");
- X break;
- X case REF:
- X printf("REF: %d\n",*ap++);
- X break;
- X case CCL:
- X printf("\tCCL [");
- X for (n = 0; n < MAXCHR; n++)
- X if (isinset(ap,(CHAR)n))
- X printf("%c",n);
- X printf("]\n");
- X ap += BITBLK;
- X break;
- X case NCL:
- X printf("\tNCL [");
- X for (n = 0; n < MAXCHR; n++)
- X if (isinset(ap,(CHAR)n))
- X printf("%c",n);
- X printf("]\n");
- X ap += BITBLK;
- X break;
- X default:
- X printf("bad dfa. opcode %o\n", ap[-1]);
- X exit(1);
- X break;
- X }
- X}
- X#endif
- SHAR_EOF
- if test 18779 -ne "`wc -c 'regex.c'`"
- then
- echo shar: error transmitting "'regex.c'" '(should have been 18779 characters)'
- fi
- echo shar: extracting "'re_fail.c'" '(292 characters)'
- if test -f 're_fail.c'
- then
- echo shar: over-writing existing file "'re_fail.c'"
- fi
- sed 's/^X//' << \SHAR_EOF > 're_fail.c'
- X
- X#ifdef vms
- X#include stdio
- X#else
- X#include <stdio.h>
- X#endif
- X
- X/*
- X * re_fail:
- X * internal error handler for re_exec.
- X *
- X * should probably do something like a
- X * longjump to recover gracefully.
- X */
- Xvoid
- Xre_fail(s, c)
- Xchar *s;
- Xchar c;
- X{
- X fprintf(stderr, "%s [opcode %o]\n", s, c);
- X exit(1);
- X}
- SHAR_EOF
- if test 292 -ne "`wc -c 're_fail.c'`"
- then
- echo shar: error transmitting "'re_fail.c'" '(should have been 292 characters)'
- fi
- echo shar: extracting "'grep.c'" '(1014 characters)'
- if test -f 'grep.c'
- then
- echo shar: over-writing existing file "'grep.c'"
- fi
- sed 's/^X//' << \SHAR_EOF > 'grep.c'
- X#ifdef vms
- X#include stdio
- X#else
- X#include <stdio.h>
- X#endif
- X
- X/*
- X * Rudimentary grep to test regex routines.
- X *
- X * DEBUG is only applicable
- X * to oZ version of regex. Make sure to
- X * compile regex.c with -DDEBUG as well.
- X *
- X */
- Xextern char *re_comp();
- Xextern re_exec();
- X
- Xmain(argc,argv)
- Xchar *argv[];
- X{
- X char str[512];
- X FILE *f;
- X register int n;
- X register char *p;
- X
- X if (argc < 2)
- X error("usage: grep pat [file]");
- X
- X if ((p = re_comp(argv[1])) != 0) {
- X printf("\t%s: %s\n", p, argv[1]);
- X exit(1);
- X }
- X#ifdef DEBUG
- X symbolic(argv[1]);
- X#endif
- X if (p = argv[2]) {
- X if ((f = fopen(p, "r")) == NULL) {
- X printf("cannot open %s\n", argv[2]);
- X exit(1);
- X }
- X while ((n = load(str, f)) != EOF)
- X if (re_exec(str))
- X printf("%s\n",str);
- X
- X }
- X exit(0);
- X}
- Xload (s, f)
- Xchar *s;
- XFILE *f;
- X{
- X register int c;
- X static int lineno = 0;
- X
- X while ((c = getc(f)) != '\n' && c != EOF)
- X *s++ = c;
- X if (c == EOF)
- X return (EOF);
- X *s = (char) 0;
- X return (++lineno);
- X}
- X
- Xerror(s) char *s ;
- X{
- X fprintf(stderr,"%s\n",s);
- X exit(1);
- X}
- SHAR_EOF
- if test 1014 -ne "`wc -c 'grep.c'`"
- then
- echo shar: error transmitting "'grep.c'" '(should have been 1014 characters)'
- fi
- echo shar: extracting "'makefile'" '(2472 characters)'
- if test -f 'makefile'
- then
- echo shar: over-writing existing file "'makefile'"
- fi
- sed 's/^X//' << \SHAR_EOF > 'makefile'
- X#
- X# regex routines (PUBLIC DOMAIN)
- X#
- X# by: Ozan S. Yigit (oz)
- X# Dept. of Computer Science
- X# York University
- X#
- X# Applicable to BSD:
- X#
- X# If you have the generic(?) regex routines
- X# than you can compare the timings of these
- X# routines to the generic ones by:
- X#
- X# make times
- X#
- X# which will create two rudimentary greps
- X# lgrep and ogrep. lgrep will use the generic
- X# regex routines, and ogrep will use oz version
- X# of regex. Several patterns will be searched
- X# in /usr/dict/words, and the output of the greps
- X# will be compared. [for reasons of sanity]
- X#
- X# Surely, you will note, the time tests are somewhat
- X# biased, since /usr/dict/words contain *short* lines,
- X# thereby the real-life case of searching a complex
- X# expression within a long line is not covered. You
- X# will find, however, that the PD regex routines will
- X# search *as fast* as the generic ones in most
- X# cases, and about 10% slower in some cases, when
- X# tested with files containing *long* lines.
- X#
- XCFLAGS = -O
- X#
- X# test patterns
- X#
- XPAT1 = '[d-f]zz*.*m'
- XPAT2 = 'fo[ornt]*.*b[a-d]*'
- XPAT3 = '.th.'
- XPAT4 = '\(ab\)[a-d]*\1'
- XPAT5 = 'burp'
- X
- XFILE = /usr/dict/words
- XOUTD = /tmp/
- X
- XRSRC = regex.o re_fail.o
- X
- Xregex: $(RSRC)
- X#
- X# insert code to put these into a library
- X#
- Xrlint:
- X lint -phc regex.c
- Xdebug:
- X cc -O -DDEBUG -o ogrep grep.c regex.c re_fail.c
- X
- Xlgrep: grep.o
- X cc -o lgrep grep.o
- X
- Xogrep: grep.o $(RSRC)
- X cc -o ogrep grep.o $(RSRC)
- X
- Xtimes: lgrep ogrep
- X @echo generic regex vs oz regex
- X# @echo pattern: $(PAT1)
- X time ogrep $(PAT1) $(FILE) >$(OUTD)ogrep.out
- X time lgrep $(PAT1) $(FILE) >$(OUTD)lgrep.out
- X @echo output differences:
- X -diff $(OUTD)ogrep.out $(OUTD)lgrep.out
- X @echo "---"
- X# @echo pattern: $(PAT2)
- X time ogrep $(PAT2) $(FILE) >$(OUTD)ogrep.out
- X time lgrep $(PAT2) $(FILE) >$(OUTD)lgrep.out
- X @echo output differences:
- X -diff $(OUTD)ogrep.out $(OUTD)lgrep.out
- X @echo "---"
- X# echo pattern: $(PAT3)
- X time ogrep $(PAT3) $(FILE) >$(OUTD)ogrep.out
- X time lgrep $(PAT3) $(FILE) >$(OUTD)lgrep.out
- X @echo output differences:
- X -diff $(OUTD)ogrep.out $(OUTD)lgrep.out
- X @echo "---"
- X# echo pattern: $(PAT4)
- X time ogrep $(PAT4) $(FILE) >$(OUTD)ogrep.out
- X time lgrep $(PAT4) $(FILE) >$(OUTD)lgrep.out
- X @echo output differences:
- X -diff $(OUTD)ogrep.out $(OUTD)lgrep.out
- X @echo "---"
- X# echo pattern: $(PAT5)
- X time ogrep $(PAT5) $(FILE) >$(OUTD)ogrep.out
- X time lgrep $(PAT5) $(FILE) >$(OUTD)lgrep.out
- X @echo output differences:
- X -diff $(OUTD)ogrep.out $(OUTD)lgrep.out
- X @echo "---"
- X @rm $(OUTD)ogrep.out $(OUTD)lgrep.out
- SHAR_EOF
- if test 2472 -ne "`wc -c 'makefile'`"
- then
- echo shar: error transmitting "'makefile'" '(should have been 2472 characters)'
- fi
- # End of shell archive
- exit 0
-
-
-